System and method for learning rankings via convex hull separation

ABSTRACT

A method for finding a ranking function ƒ that classifies feature points in an n-dimensional space includes providing a plurality of feature points x k  derived from tissue sample regions in a digital medical image, providing training data A comprising training samples A j  where  
         A   =       ⋃     j   =   1     S     ⁢     (       A   j     =       {     x   i   j     }       i   =   1       m   j         )         ,       
 
providing an ordering E={(P,Q)|A P   A Q } of at least some training data sets where all training samples x i εA P  are ranked higher than any sample x j εA Q , solving a mathematical optimization program to determine the ranking function ƒ that classifies said feature points x into sets A. For any two sets A i , A j , A i   A j , and the ranking function ƒ satisfies inequality constraints ƒ(x i )≦ƒ(x j ) for all x i εconv(A i ) and x j εconv(A j ), where conv(A) represents the convex hull of the elements of set A.

CROSS REFERENCE TO RELATED UNITED STATES APPLICATIONS

This application claims priority from “A Convex Hulls SeparationAlgorithm for Ranking”, U.S. Provisional Application No. 60/687,540 ofFung, et al., filed Jun. 3, 2005, the contents of which are incorporatedherein by reference.

TECHNICAL FIELD

This invention is directed to the automatic ranking and classificationof digital data, in particular for identifying features and objects indigital medical images.

DISCUSSION OF THE RELATED ART

Physicians and scientists have long explored the use of artificialintelligence systems in medicine. One area of research has been buildingcomputer-aided diagnosis (CAD) systems for the automated interpretationand analysis of medical images, in order to classify and identify normaland abnormal features in a dataset. For example, such systems could beused for classifying and identifying polyps, tumors, and other abnormalgrowths from normal tissue in a digital medical image of a patient.

Many machine learning applications useful for the automatedinterpretation of medical images depend on accurately ordering theelements of a set based on the known ordering of only some of itselements. A known ordering of this type can arise from a physician'sranking of objects in an image as being abnormal, for example, a polypor a tumor. In this type of situation, the physician assigns a ranking,for example a number between 1 and 10, of an object being abnormal, witha 1 indicating that the object is not abnormal, and a 10 indicating thatthe object is almost certainly abnormal. In the literature, variants ofthis problem have been referred to as ordinal regression, ranking, andlearning of preference relations. Formally, the goal is to find afunction ƒ:

^(n)→

such that, for a set of test samples {x_(k)ε

^(n)}, the output of the function ƒ(x_(k)) can be sorted to obtain aranking. In order to learn such a function there is provided trainingdata, A, containing S sets (or classes) of training samples:${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$where the j-th set A^(j) contains m_(j) samples, so that there is atotal of $m = {\sum\limits_{j = 1}^{S}m_{j}}$samples in A. Further, there is also provided a directed order graphG=(V, E) each of whose vertices corresponds to a class A^(j), and theexistence of a directed edge from A^(P) to A^(Q), denoted E_(PQ),signifies that all training samples x_(i)εA^(P) should be ranked higherthan any sample x_(j)εA^(Q): i.e. ∀(x_(i)εA^(P), x_(j)εA^(Q)),ƒ(x_(i))≦ƒ(x_(j)).

In general the number of constraints on the ranking function grows asO(m²) so that naive solutions are computationally infeasible even formoderate sized training sets with a few thousand samples. One approachused a non-parametric Bayesian model for ordinal regression based onGaussian processes. Inference in this model is computationallyintractable; thus it was necessary to employ approximate inferencemethods (EP and Laplace approximations), although GP's are notrestricted to these.

The problem of learning rankings was first treated as a classificationproblem on pairs of objects and subsequently used on a web page rankingtask. The major advantage of this approach is that it considers a moreexplicit notion of ordering; however, the naive optimization strategyproposed there suffers from the O(m²) growth in the number ofconstraints previously mentioned. This computational burden rendersthese methods impractical even for medium sized datasets with a fewthousand samples. In other related work, boosting methods have beenproposed for learning preferences, and a combinatorial structure calledthe ranking poset was used for conditional modeling of partially rankeddata in the context of combining ranked sets of web pages produced byvarious web page search engines. A different type of approach uses aneural network to rank the inputs.

SUMMARY OF THE INVENTION

Exemplary embodiments of the invention as described herein generallyinclude methods and systems for learning ranking functions from orderconstraints between sets or classes of training samples. In particular,constraints on the ranking function are modified to:∀(x_(i)εconv(A^(P)), x_(j)εconv(A^(Q))), ƒ(x_(i))≦ƒ(x_(j)), whereconv(A^(j)) denotes the set of all points in the convex hull of A^(j).This leads to: (1) a family of approximations to the original problem;and (2) considerably more efficient solutions that still enforce all ofthe original inter-group order constraints. Notice that this formulationsubsumes the standard ranking problem as a special case when each setA^(j) is reduced to a singleton, and the order graph is equal to a fullgraph. A ranking algorithm according to an embodiment of the inventionpenalizes wrong orderings of pairs of training instances in order tolearn ranking functions, but in addition, utilizes the notion of astructured class order graph. Nevertheless, using a formulation based onconstraints over convex hulls of the training classes avoids theprohibitive computational complexity of the previous algorithms forranking.

FIGS. 1(a)-(f) illustrate the types of graphs that can be incorporatedinto a ranking method according to an embodiment of the invention, inparticular, various instances consistent with the training set {v, w, x,y, z} satisfying v>w>x>y>z. Each problem instance is defined by an ordergraph. FIGS. 1(a)-(d) depict a succession of order graphs with anincreasing number of constraints. FIGS. 1(e)-(f) illustrate two ordergraphs defining the same partial ordering but different probleminstances. As illustrated in FIG. 1, a ranking formulation according toan embodiment of the invention does not require a total ordering of thesets of training samples A^(j) in that it allows any order graph G to beincorporated into the problem.

Ranking algorithms according to embodiments of the invention can be usedfor maximizing the generalized Wilcoxon Mann Whitney statistic thataccounts for the partial ordering of the classes. Special cases includemaximizing the area under the receiver-operating-characteristic (ROC)curve for binary classification and its generalization for ordinalregression. Experiments on public benchmarks indicate that: (1) analgorithm according to an embodiment of the invention is at least asaccurate as the current state-of-the-art; and that (2) computationally,it is several orders of magnitude faster and, unlike current methods,can easily handle large datasets with over 20,000 samples.

According to an aspect of the invention, there is provided a method forfinding a ranking function ƒ that classifies feature points in ann-dimensional space, the method including providing a plurality offeature points x_(k) in an n-dimensional space R^(n), providing trainingdata A comprising a plurality of sets of training samples A^(j) wherein${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$wherein S is a number of sets and a j^(th) set A^(j) includes m_(j)samples x_(i) ^(j), providing an ordering E={(P,Q)|A^(P)

A^(Q)} of at least some of said training data sets wherein all trainingsamples x_(i)εA^(P) are ranked higher than any sample x_(j)εA^(Q),solving a mathematical optimization program to determine said rankingfunction ƒ that classifies said feature points x into said plurality ofsets A, wherein for any two sets A^(i), A^(j), wherein A^(i)

A^(j), the ranking function ƒ satisfies inequality constraintsƒ(x_(i))≦ƒ(x_(j)) for all x_(i)εconv(A^(i)) and x_(j)εconv(A^(j)),wherein conv(A) represents the convex hull of the elements of set A.

According to a further aspect of the invention, the ranking function isa linear function of the feature points x of the form w′x, wherein w isan n-dimensional vector.

According to a further aspect of the invention, the mathematicaloptimization program includes slack variables y greater or equal to zerofor non-separable sets wherein not all inequality constraints can besatisfied, wherein said slack variables are a measure of the extent towhich constraints are violated in said mathematical program.

According to a further aspect of the invention, the method comprises oneslack variable y^(i) for each of said training samples x_(i), whereinany training sample point inside the convex hull of any set isassociated with a slack variable equal to a convex combination of y^(i)with coefficients λ.

According to a further aspect of the invention, the mathematical programis of form${\min\limits_{\{{w,y^{i},{y^{ij}❘{{({i,j})} \in E}}}\}}{v{y}^{2}}} + {\frac{1}{2}w^{\prime}w}$such that the equation set Q_(ij) is satisfied ∀(i,j)εE, wherein w is ann-dimensional vector, v is real number controlling the trade off betweenthe two terms, equation set Q_(ij) is ${Q_{ij} \equiv \begin{Bmatrix}\begin{matrix}\begin{matrix}{{\gamma^{ij} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} \geq 0} \\{{{\hat{\gamma}}^{ij} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} \geq 0}\end{matrix} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} \leq {- 1}}\end{matrix} \\{y^{i},{y^{j} \geq 0}}\end{Bmatrix}},$wherein γ^(ij) and {circumflex over (γ)}^(ij) are derived by applyingFarka's theorem to inequality conditions w′A^(j)λ^(j)−w′A^(i)λ^(i)≦−1 onconstraints λ^(j), λ^(i), respectively, wherein0 ≤ λ^(i, j) ≤ 1, ∑λ^(i, j) = 1,and K is an arbitrary kernel function.

According to a further aspect of the invention, the linear inequalityconstraints are equalities represented by ${Q_{ij} = \begin{Bmatrix}\begin{matrix}{{{\gamma^{ij} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} = 0},} \\{{{{\hat{\gamma}}^{ij} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} = 0},}\end{matrix} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} = {- 1.}}\end{Bmatrix}},$wherein vε

is a weighting of said slack terms, γ^(ij) and {circumflex over(γ)}^(ij) are derived by applying Farka's theorem to equality conditionsw′A^(j)λ^(j)−w′A^(i)λ^(i)=−1 on constraints λ^(j), λ^(i), respectively,wherein 0≦λ^(i,j)≦1, ∑λ^(i, j) = 1,and K is an arbitrary kernel function, and wherein said mathematicalprogram is of form${\min\limits_{\{{v,{\gamma^{ij}❘{{({i,j})} \in E}}}\}}{\frac{1}{2}{\sum\limits_{{({i,j})} \in E}\left\lbrack {v\begin{matrix}\begin{pmatrix}{{{{- \gamma^{ij}} - {{K\left( {A^{i},A^{\prime}} \right)}v}}}_{2}^{2} +} \\{{{{\hat{\gamma}}^{ij} + {{K\left( {A^{j},A^{\prime}} \right)}v}}}_{2}^{2} +}\end{pmatrix} \\{\mu{{{\hat{\gamma}}^{ij} + \gamma^{ij} + 1}}_{2}^{2}}\end{matrix}} \right\rbrack}}} + {v}_{2}^{2}$wherien με

is a weighting of the equality constraints.

According to a further aspect of the invention, the method comprisessolving said mathematical program by means of least squares.

According to a further aspect of the invention, μ is approximately one.

According to a further aspect of the invention, the number of sets istwo, represented by A⁺ and A⁻, wherein A⁻

A⁺, and wherein the ranking function satisfies the constraints${{{w^{\prime}A^{\prime -}\lambda^{-}} - {w^{\prime}A^{\prime +}\lambda^{+}}} \leq {- 1}},{{for}\quad{all}\quad\left( {\lambda^{+},\lambda^{-}} \right)\quad{such}\quad{that}\quad\begin{Bmatrix}{{0 \leq \lambda^{+} \leq 1},{{\sum\lambda^{+}} = 1}} \\{{0 \leq \lambda^{-} \leq 1},{{\sum\lambda^{-}} = 1}}\end{Bmatrix}},$wherein w is a vector in

^(n).

According to a further aspect of the invention, A⁺ and A⁻ arenon-separable, and wherein the ranking function satisfies

w′A′⁻λ⁻−w′A′⁺λ⁺≦−1+(λ⁻y⁻+λ⁺y⁺), wherein y⁺, y⁻ are slack variablesgreater than or equal to zero.

According to a further aspect of the invention, the feature pointsrepresent tissue sample regions.

According to a further aspect of the invention, the method comprisesusing said ranking to determine a probability of said tissue samplebeing diseased.

According to a further aspect of the invention, the method comprisesusing said ranking to determine a malignancy of diseased tissue sampleregions.

According to a further aspect of the invention, the tissue sampleregions are derived from a plurality of patients, and further comprisingusing said ranking to sort said plurality of patients according to apredetermined criteria.

According to a further aspect of the invention, the ordering of at leastsome of said training data sets is provided by a physician.

According to a further aspect of the invention, the training samples areassigned to sets based on the results of a diagnostic test.

According to a further aspect of the invention, the training samples areassigned to sets by a physician.

According to a further aspect of the invention, the feature points arederived from a patient's electronic medical record.

According to another aspect of the invention, there is provided aprogram storage device readable by a computer, tangibly embodying aprogram of instructions executable by the computer to perform the methodsteps for finding a ranking function ƒ that classifies feature points inan n-dimensional space.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1(a)-(f) illustrate the types of graphs that can be incorporatedinto a ranking method according to an embodiment of the invention.

FIG. 2 depicts an exemplary non-separable binary problem, according toan embodiment of the invention.

FIG. 3 displays a list of nine publicly available datasets upon which aranking method according to an embodiment of the invention was tested.

FIGS. 4(a)-(b) are graphs of the results of comparisons of currentranking algorithms and a ranking method according to an embodiment ofthe invention.

FIGS. 5(a)-(b) are graphs of summary results of an experimentalevaluation for a least-squares formulation of a ranking method accordingto an embodiment of the invention.

FIG. 6 is a flow chart of a ranking method according to an embodiment ofthe invention.

FIG. 7 is a block diagram of an exemplary computer system forimplementing a ranking method according to an embodiment of theinvention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Exemplary embodiments of the invention as described herein generallyinclude systems and methods for learning ranking functions from orderconstraints between sets or classes of training samples. Accordingly,while the invention is susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that there is no intent to limit theinvention to the particular forms disclosed, but on the contrary, theinvention is to cover all modifications, equivalents, and alternativesfalling within the spirit and scope of the invention.

The following notation will be used herein below. Vectors will beassumed to be column vectors unless transposed to a row vector by aprime superscript ′. For a vector x in the n-dimensional real space

^(n), the cardinality of a set A will be denoted by |A|. The scalar(inner) product of two vectors x and y in the n-dimensional real space

^(n) will be denoted by x′y and the 2-norm of x will be denoted by ∥x∥.For a matrix Aε

^(m×n), A_(i)ε

^(n) denotes a row vector formed by the elements of the i-th row of A.Similarly A_(j)ε

^(n) denotes a column vector formed by the elements of the j-th columnof A. A column vector of ones of arbitrary dimension will be denoted bye. For Aε

^(m×n) and Bε

^(n×k), the kernel K(A,B) maps

^(m×n)×

^(n×k) into

^(m×k). In particular, if column vectors in

^(n), then K(x′,y) is a real number, K(x′,A′) is a row vector in

^(n), and K(A,A′) is an m×m matrix. The identity matrix of arbitrarydimension will be denoted by I.

A distinction is usually made between classification and ordinalregression methods on one hand, and ranking on the other. In particular,the loss functions used for classification and ordinal regressionevaluate whether each test sample is correctly classified: in otherwords, the loss functions that are used to evaluate these algorithms,such as the 0-1 loss for binary classification, are computed for everysample individually, and then averaged over the training or test set.

By contrast, bipartite ranking solutions are evaluated using theWilcoxon-Mann-Whitney (WMW) statistic, which measures the (sampleaveraged) probability that any pair of samples is ordered correctly.Intuitively, the WMW statistic can be interpreted as the area under theROC curve. According to an embodiment of the invention, a generalizationof the WMW statistic is defined that accounts for class-ordering:${{WMW}\left( {f,A} \right)} = {\sum\limits_{E_{ij}}{\frac{\sum\limits_{k = 1}^{m_{i}}{\sum\limits_{l = 1}^{m_{j}}1_{({{f{(x_{i}^{k})}} < {f{(x_{l}^{j})}}})}}}{\sum\limits_{k = 1}^{m_{i}}{\sum\limits_{l = 1}^{m_{j}}1}}.}}$Hence, if a sample is individually misclassified because it falls on thewrong side of the decision boundary between classes, it incurs a penaltyin ordinal regression, whereas, in ranking, it may be possible that itis still correctly ordered with respect to every other test sample, andthus it will incur no penalty in the WMW statistic.Convex Hull Formulation

A ranking method according to an embodiment of the invention learns aranking function ƒ:

^(n)→

given known ranking relationships between some training instancesA^(i),A^(j)⊂A (or A⁺ and A⁻ in the two class, binary case). Let theranking relationships be specified by the set E={(i, j)|A^(i)

A^(j)}. For ease of notation, the pairs (i, j) in the set E will bedenoted E_(ij). Consider the linearly separable binary ranking casewhich is equivalent to classifying m points in the n-dimensional realspace

^(n), represented by the m×n matrix A, according to membership of eachpoint x=A_(i) in the class A⁺ or A⁻ as specified by a given vector oflabels d. For binary classifiers, this is equivalent to a linear rankingfunction ƒ_(w)(x)=w′x that satisfies the following constraints:∀(x ⁺ εA ⁺ ,x ⁻ εA ⁻),ƒ(x ⁻)≦ƒ(x ⁺)

ƒ(x ⁻)−ƒ(x ⁺)=w′x ⁻ −w′x ⁻ −w′x ⁺≦−1≦0.  (1)

The number of constraints in equation (1) grows as O(m⁺m⁻), which isroughly quadratic in the number of training samples (unless there is asevere class imbalance). While additional insights permit this to beovercome in the separable case, in the non-separable case, the quadraticgrowth in the number of constraints poses computational burdens on anyoptimization algorithm, and direct optimization with these constraintsis unfeasible even for moderate sized problems. This computationalproblem can be addressed based on three insights that are explainedbelow.

First, notice that, by negation, the feasibility constraints in (1) canalso be defined as:∀(x ⁺ εA ⁺ ,x ⁻ εA ⁻),w′x ⁻ −w′x ⁺≦−1

(x ⁺ εA ⁺ ,x ⁻ εA ⁻),w′x ⁻ −w′x ⁺>−1.In other words, a solution w is feasible iff there exist no pair ofsamples from the two classes such that ƒ_(w)(·) orders them incorrectly.

Second, the constraints in (1) can be made more stringent: instead ofrequiring that equation (1) be satisfied for each possible pair(x⁺εA⁺,x⁻εA⁻) in the training set, require (1) to be satisfied for eachpair (x⁺εconv(A⁺),x⁻εconv(A⁻)), where conv(A^(i)) denotes the convexhull of the set A^(i). Thus, the constraints become: $\begin{matrix}{{\forall{\left( {\lambda^{+},\lambda^{-}} \right)\quad{such}\quad{that}\begin{Bmatrix}{{0 \leq \lambda^{+} \leq 1},{{\sum\lambda^{+}} = 1}} \\{{0 \leq \lambda^{-} \leq 1},{{\sum\lambda^{-}} = 1}}\end{Bmatrix}}},{{{w^{\prime}A^{- \prime}\lambda^{\prime}} - {w^{\prime}A^{+ \prime}\lambda^{\prime}}} \leq {- 1.}}} & (2)\end{matrix}$Next, notice that all the linear inequality and equality constraints on(λ⁺, λ⁻) can be grouped together as Bλ[b, where,${\lambda = \begin{bmatrix}\lambda^{-} \\\lambda^{+}\end{bmatrix}_{m \times 1}},{b^{+} = \begin{bmatrix}0_{m^{+} \times 1}^{+} \\1 \\{- 1}\end{bmatrix}_{{({m^{+} + 2})} \times 1}},{b^{-} = \begin{bmatrix}0_{m^{-} \times 1}^{-} \\1 \\{- 1}\end{bmatrix}_{{({m^{-} + 2})} \times 1}},{b = \begin{bmatrix}b^{+} \\b^{-}\end{bmatrix}},{B^{-} = \begin{bmatrix}{{- I_{m^{-}}}0} \\{{\mathbb{e}}^{\prime}\quad 0} \\{{- {\mathbb{e}}^{\prime}}\quad 0}\end{bmatrix}_{{({m^{-} + 2})} \times m}},{B^{+} = \begin{bmatrix}{0 - I_{m^{+}}} \\{0\quad{\mathbb{e}}^{\prime}} \\{0 - {\mathbb{e}}^{\prime}}\end{bmatrix}_{{({m^{+} + 2})} \times m}},{B = {\begin{bmatrix}B^{-} \\B^{+}\end{bmatrix}.}}$Thus, the constraints on w can be written in the following equivalentforms:∀λs.t. Bλ≦b, w′A ⁻ ^(′) λ⁻ −w′A ⁺ ^(′) λ⁺≦−1,  (3a)

λs.t. B′λ≦b, w′A ⁻ ^(′) λ⁻ −w′A ⁺ ^(′) λ⁺>−1,  (3b)

∃u s.t. B′u−w′[A ⁻ ^(′) −A ⁺ ^(′) ]=0, b′u≦−1,u≧0,  (3c)where the second equivalent form of the constraints was obtained bynegation (as before), and the third equivalent form results from thethird insight: the application of Farka's theorem of alternatives.Farka's theorem states that for an x≧0 where Ax=b, there exists a z suchthat z′A≧0 and z′b<0. The use of Farka's theorem allows one toincorporate logical conditions into a set of equations. In the situationabove, the logical condition is of the form: IF Bλ≦b THEN w′A⁻ ^(′)λ⁻−w′A⁺ ^(′) λ⁺≦−1, while (3c) is the resulting equations. Theapplication of Farka's theorem is referred to herein as a Farkatransformation. Note that the resulting equations can be inequalities.The resulting linear system of m equalities and m+5 inequalities inm+n+4 variables can be used while minimizing any regularizer (such as∥w∥²) to obtain the linear ranking function that satisfies equation (1).Note that this formulation avoids the O(m²) scaling in constraints.Binary Non-Separable Case

In a binary non-separable case, conv(A⁺)∩conv(A⁻)≠∅, so the requirementsshould be relaxed by introducing slack variables. One slack variabley_(i)≧0 can be introduced for each training sample x_(i), and the slackfor any point inside the convex hull conv(A^(j)) can be expressed as aconvex combination of the y_(i)'s. This implies that if only a subset oftraining samples has non-zero slacks y_(i)>0, (i.e. they are possiblymisclassified), then the slacks of any points inside the convex hullalso only depend on those y_(i). Thus, the constraints now become:∀λs.t. Bλ≦b, w′A ⁻ ^(′) λ⁻ −w′A ⁺ ^(′) λ⁺≦−1+(λ⁻ y ⁻+λ⁺ y ⁺), y ⁺ ,y⁻≧0.Applying Farka's theorem of alternatives, one finds that equation (2) isequivalent to: $\begin{matrix}{{{\exists{{u\quad{s.t.\quad B^{\prime}}u} - \begin{bmatrix}{A^{-}w} \\{{- A^{+}}w}\end{bmatrix} + \begin{bmatrix}y^{-} \\y^{+}\end{bmatrix}}} = 0},{{b^{\prime}u} \leq {- 1}},{u \geq 0}} & (3)\end{matrix}$Replacing B from the above definitions and definingu′=[(u⁻)^(′)(u⁺)^(′)]≧0, the following constraints are obtained:(B ^(i))^(′) u ⁺ +A ⁺ w+y ⁺=0,(B ^(i))^(′) u ⁻ −A ⁻ w+y ⁻=0,b ⁺ u ⁺ +b ⁻ u ⁻≦−1,u≧0.

FIG. 2 depicts an exemplary non-separable binary problem, according toan embodiment of the invention. Referring to the figure, pointsbelonging to the A+ and A⁻ sets are represented by circles andtriangles, respectively. Two elements x_(i) and x_(j) of the set A⁻ arenot correctly ordered and hence generate positive values of thecorresponding slack variables y_(i) and y_(j). On the other hand,element x_(k), represented by a hollow triangle, is in the convex hullof the set A⁻ and hence the corresponding y_(k) error can be written asa convex combination y_(k)=λ_(i) ^(k)y_(i)+λ_(j) ^(k)y_(j) of the twononzero errors corresponding to points of A⁻.

The General Ranking Case

The three insights presented above can be applied to any arbitrarydirected order graph G=(S, E), each of whose vertices corresponds to aclass A^(j) and where the existence of a directed edge E_(ij) means thatall training samples x_(i)εA^(i) should be ranked higher than any samplex_(j)εA^(j):ƒ(x ^(j))≦ƒ(x ^(i))

ƒ(x ^(j))−ƒ(x ^(i))=w′x ^(j) −w′x ^(i)≦−1≦0.Analogously, the following set of equations that enforces the orderingbetween sets A^(i) and A^(j) can be obtained:(B ^(i))^(′) u ^(ij) +A ^(i) w+y ^(i)=0,(B ^(j))^(′) û ^(ij) −A ^(j) w+y ^(j)=0,b ^(i) u ^(ij) +b ^(j) û ^(ij)≦−1,u ^(ij) ,û ^(ij)≦0.  (4)Furthermore, using the definitions of B^(i), B^(j), b^(i), b^(j) and thefact that u^(ij),û^(ij)≧0, the constraints of equations (4) can berewritten in the following way:eγ ^(ij) +A ^(i) w+y ^(i)≧0,e{circumflex over (γ)} ^(ij) −A ^(j) w+y ^(j)≧0,γ^(ij)+{circumflex over (γ)}^(ij)≦−1,y ^(i) ,y ^(j)≧0.  (5)where γ^(ij)=b^(i)u^(ij) and {circumflex over (γ)}^(ij)=b^(j)û^(ij).Note that enforcing the constraints defined above implies a desiredordering:A ^(i) w+y ^(i) ≧−eγ ^(ij) ≧e{circumflex over (γ)} ^(ij)+1≧e{circumflexover (γ)} ^(ij) ≧A ^(j) w−y ^(j).To obtain a more general nonlinear algorithm, equations (4) can be“kernelized” by making a transformation of the variable w as: w=A′v,where v can be interpreted as an arbitrary variable in

^(m). Employing this transformation, equations (4) become:eγ ^(ij) +A ^(i) A′v+y ^(i)≧0,e{circumflex over (γ)} ^(ij) −A ^(j) A′v+y ^(j)≧0,γ^(ij)+{circumflex over (γ)}^(ij)≦−1,y ^(i) ,y ^(j)≧0If the linear kernels A^(i)A′ and A^(j)A′ are replaced by more generalkernels K(A^(i),A′) and K(A^(j), A′), a “kernelized” version ofequations (4) is obtained: $\begin{matrix}{Q_{ij} = {\begin{Bmatrix}{{{e\quad\gamma^{ij}} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} \geq 0} \\{{{e\quad{\hat{\gamma}}^{ij}} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} \geq 0} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} \leq {- 1}} \\{y^{i},{y^{j} \geq 0}}\end{Bmatrix}.}} & (5)\end{matrix}$Given a graph G=(S, E) representing the ordering of the training dataand using equations (5), a general mathematical programming formulationfor ranking can be presented: $\begin{matrix}\begin{matrix}{{\min\limits_{\{{v,y^{i},{\gamma^{ij}❘{{({i,j})} \in E}}}\}}{v\quad{ɛ(y)}}} + {R(v)}} & {s.t.\quad Q_{ij}} & {\forall{\left( {i,j} \right) \in {E.}}}\end{matrix} & (6)\end{matrix}$In equation (6), ε is a loss function for the slack variables y^(i) andR(v) represents a regularizer on the normal to the hyperplane v. For anarbitrary kernel K(x,x′), the number of variables in formulation (6) is2m+2|E|, and the number of linear equations (excluding thenon-negativity constraints) is m|E|+|E|=|E|(m+1). For a linear kernel,K(x,x′)=xx′, the number of variables of formulation (6) becomesm+n+2|E|, and the number of linear equations remains the same. Whenusing a linear kernel and using ε(x)=R(x)=∥x∥², the optimizationformulation (6) becomes a linearly constrained quadratic optimizationsystem for which a unique solution exists due to the convexity of theobjective function: $\begin{matrix}{{\min\limits_{\{{w,y^{i},{\gamma^{ij}❘{{({i,j})} \in E}}}\}}{v{y}^{2}}} + {\frac{1}{2}w^{\prime}w}} & {s.t.} & Q_{ij} & {\forall{\left( {i,j} \right) \in {E.}}}\end{matrix}$Unlike SVM-like methods for ranking that need O(m²) slack variables, aformulation according to an embodiment of the invention only requiresone slack variable for each example, and only m slack variables areused, giving this formulation a computational advantage over otherranking methods.Least-Squares Formulation

A least squares solution to ranking equations can be formulated byrelaxing the inequalities of (6) in the following way: $\begin{matrix}{Q_{ij} = \begin{Bmatrix}{{{e\quad\gamma^{ij}} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} \geq 0} \\{{{e\quad{\hat{\gamma}}^{ij}} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} \geq 0} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} \leq {- 1}}\end{Bmatrix}} & (7)\end{matrix}$

Given a graph G=(V, E) representing the ordering of the training dataand using the “relaxed” constraints (7), the following unconstrainedstrongly convex quadratic programming formulation can be obtained forthe ranking equations: $\begin{matrix}{{\min\limits_{\{{v,{\gamma^{ij}❘{{({i,j})} \in E}}}\}}{\sum\limits_{{({i,j})} \in E}\left\lbrack {{v\left( {{{{{- e}\quad\gamma^{ij}} - {{K\left( {A^{i},A^{\prime}} \right)}v}}}_{2}^{2} + {{{e\quad{\hat{\gamma}}^{ij}} - {{K\left( {A^{j},A^{\prime}} \right)}v}}}_{2}^{2}} \right)} + {\mu{{{\hat{\gamma}}^{ij} + \gamma^{ij} + 1}}_{2}^{2}}} \right\rbrack}} + {v}_{2}^{2}} & (8)\end{matrix}$where vε

and με

are regularization parameters that are selected by cross-validation onthe training data. However, according to one embodiment of the inventionm, a value of μ=1 works well, wherein in experiments testing theformulation only the v parameter need be tuned. To find the uniqueminimizer of formulation (8), one solves for the gradient of theobjective function to be equal to zero, obtaining the following systemof linear equations:${{{\sum\limits_{{({i,j})} \in E}^{\quad}{v\left\lbrack {{\left( {{{\mathbb{e}}\quad\gamma^{ij}} + {{K\left( {A^{i},A^{\prime}} \right)}v}} \right)^{\prime}{K\left( {A^{i},A^{\prime}} \right)}} + {\left( {{{- {\mathbb{e}}}{\hat{\gamma}}^{ij}} + {{K\left( {A^{j},A^{\prime}} \right)}v}} \right)^{\prime}{K\left( {A^{j},A^{\prime}} \right)}}} \right\rbrack}} + \quad v^{\prime}} = 0},{{{{v\left( {{{\mathbb{e}}\quad\gamma^{ij}} + {{K\left( {A^{i},A^{\prime}} \right)}v}} \right)}e} + {\mu\left( {{\hat{\gamma}}^{ij} + \gamma^{ij} + 1} \right)}} = 0},{\forall{\left( {i,j} \right) \in {E.}}}$When using a linear kernel, e.g. K(x, y)=x′y, w=A′v, the linear systemof equations to solve becomes:${{\sum\limits_{{({i,j})} \in E}^{\quad}{v\left\lbrack {{\left( {{\mathbb{e}\gamma}^{ij} + {A^{i}w}} \right)^{\prime}A^{i}} + {\left( {{- {\mathbb{e}\gamma}^{ij}} + {A^{j}w}} \right)^{\prime}A^{j}}} \right\rbrack}} + w^{\prime}} = 0$v(𝕖γ^(ij) + A^(i)w)^(′)e + μ(γ^(ij) + γ^(ij) + 1) = 0, ∀(i, j) ∈ E.Geometrically, when G is a chain graph, a hyperplane can be found thatfits every class A^(i) in the least square sense while simultaneouslymaximizing the margins between the classes.

According to an embodiment of the invention, the resulting square linearsystem of equations is of the size n+2|(E)|, where n is the number offeatures, usually a relatively small number (in the low hundreds) formost real-life applications. As a result, this least-squares formulationyields another order-of-magnitude improvement in the run-time of aranking algorithm according to an embodiment of the invention.

EXPERIMENTAL EVALUATION

A ranking method according to an embodiment of the invention was testedon a set of nine publicly available datasets shown in the table of FIG.3. These datasets have been frequently used as a benchmark for ordinalregression methods. Here they are used for evaluating rankingperformance. A method according to an embodiment of the invention istested against SVM for ranking and an efficient Gaussian process method(the informative vector machine (IVM)).

Since these datasets were originally designed for testing regression,the continuous target values for each dataset were discretized into fiveequal size bins. These bins were used to define ranking constraints: allthe datapoints with target values falling in the same bin were groupedtogether. Each dataset was divided into 10% for testing and 90% fortraining, in a 10-fold cross validation schedule. Thus, for all of thealgorithms tested, the input was, for each point in the training set:(1) a vector in

^(n), where n is different for each set; and (2) a value from 1 to 5denoting the rank of the group to which it belongs.

Accuracy of these algorithms is defined in terms of the Wilcoxonstatistic for ordinal regression. Since information about the ranking ofthe elements within each group is not used, order constraints within agroup cannot be verified. The total number of order constraints forordinal regression is equal to ${\begin{pmatrix}m \\2\end{pmatrix} - {\sum_{i}\begin{pmatrix}m_{i} \\2\end{pmatrix}}},$where m_(i) is the number of instances in group i.

The results for all methods tested are shown in FIG. 4. A formulationaccording to an embodiment of the invention was tested employing twodifferent order graphs: the full directed acyclic graph and the chaingraph. For each dataset, the accuracy of a method according to anembodiment of the invention is either comparable to or better thancurrent methods, when using a chain order graph. Regarding run-timeperformance, an algorithm according to an embodiment of the inventioncan be up to at least an order of magnitude faster than currentimplementations of state-of-the-art methods.

FIGS. 4(a)-(b) are graphs of the results of comparisons of currentranking algorithms and a ranking method according to an embodiment ofthe invention. FIG. 4(a) displays accuracy results, measured using thegeneralized Wilcoxon statistic, while FIG. 4(b) displays run-timeperformance results. The datasets tested were those listed in the tableof FIG. 3. Along with the mean values in 10 fold cross-validation, theentire range of variation is indicated in the error-bars. Referring toFIG. 4(a), the overall accuracy for all the three methods is comparable.Referring to FIG. 4(b), a method according to an embodiment of theinvention has a lower run time than the other methods, even for the fullgraph case for medium to large size datasets.

Note, however, that the accuracy of a method according to an embodimentof the invention when using a full graph is lower than that for a chaingraph. Since the evaluation of accuracy was performed using the extendedWilcoxon statistic for ordinal regression, which inherently reflects thechain graph in terms of the ordering of the classes, this observation isnot entirely surprising. Nevertheless, it is interesting that enforcingmore order constraints does not necessarily imply higher accuracy. Thismay be due to the role that the slack variables play in bothformulations, since the number of slack variables remains the same whilethe number of constraints increases. Adding more slack variables maypositively affect performance in the full graph, but this comes at acomputational cost.

A least squares approximation according to an embodiment of theinvention was tested for a chain graph using the same publicly availabledatasets, and the same experimental setup as above, in a 10-fold crossvalidation. Results are shown in FIG. 5. On average, the approximatemethod is less accurate as compared to the original one. However,accuracy is still comparable with other current methods. Run-timeperformance is is about an order of magnitude faster than the originalformulation for the chain graph.

FIGS. 5(a)-(b) are graphs of summary results of an experimentalevaluation for a least-squares formulation of a ranking method accordingto an embodiment of the invention. FIG. 5(a) displays accuracy results,measured using the generalized Wilcoxon statistic, while FIG. 5(b)displays run-time performance results. The datasets tested were thoselisted in the table of FIG. 3. The graphs show mean values and entirerange of variation, as indicated by the error-bars, in a 10 fold crossvalidation. The results are for the least squares approximation vs. thebasic ranking formulation, according to embodiments of the invention,using the two types of order graphs tested in the previous experiment.Referring to FIG. 5(a), overall accuracy of the least squares method iscomparable with that of competing methods, as depicted in FIGS. 4, andslightly worse than that of a basic ranking formulation according to anembodiment of the invention. Referring to FIG. 5(b), in terms ofrun-time, a least squares formulation is several orders of magnitudefaster than the fastest method tested, including a basic formulationaccording to an embodiment of the invention.

A flow chart of a ranking method according to an embodiment of theinvention is depicted in FIG. 6. Referring now to the figure, aplurality of feature points x_(k) in an n-dimensional space R^(n) isprovided in step 61. The feature points can derived from tissue sampleregions in a digital medical image. Alternatively, the feature pointscould be obtained from a patient's electronic medical record, or couldrepresent individual patients for the purpose of sorting patients by aseverity of disease. A training data A that includes a plurality of setsof training samples A^(j) is provided at step 62, where${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$where S is the number of sets and the j^(th) set A^(j) includes mysamples x_(i) ^(j). At step 63, an ordering E={(P,Q)|A^(P)

A^(Q)} of at least some of the training data sets is provided, whereE_(PQ) signifies that all training samples x_(i)εA^(P) are ranked higherthan any sample x_(j)εA_(Q). A mathematical optimization program issolved at step 64 to determine the ranking function ƒ that classifiesthe feature points x into the sets A, where for any two sets A^(i),A^(j), one has A^(i)

A^(j). The ranking function ƒ satisfies inequality constraintsƒ(x_(i))≦ƒ(x_(j)) for all x_(i)εconv(A^(i)) and x_(j)εconv(A^(j)), whereconv(A) represents the convex hull of the elements of set A. The rankingcan represent categorizing the probability of or status of a disease,for example, ranking cancer lesions as [1] definitely malignant, [2]likely malignant, [3] not sure, [4] likely benign, [5] definitelybenign, or other disease status categories, or ranking sample regions inorder of the probability of the region being diseased.Hardware Support

It is to be understood that the present invention can be implemented invarious forms of hardware, software, firmware, special purposeprocesses, or a combination thereof. In one embodiment, the presentinvention can be implemented in software as an application programtangible embodied on a computer readable program storage device. Theapplication program can be uploaded to, and executed by, a machinecomprising any suitable architecture.

FIG. 7 is a block diagram of an exemplary computer system forimplementing a ranking method according to an embodiment of theinvention. Referring now to FIG. 7, a computer system 71 forimplementing the present invention can comprise, inter alia, a centralprocessing unit (CPU) 72, a memory 73 and an input/output (I/O)interface 74. The computer system 71 is generally coupled through theI/O interface 74 to a display 75 and various input devices 76 such as amouse and a keyboard. The support circuits can include circuits such ascache, power supplies, clock circuits, and a communication bus. Thememory 73 can include random access memory (RAM), read only memory(ROM), disk drive, tape drive, etc., or a combinations thereof. Thepresent invention can be implemented as a routine 77 that is stored inmemory 73 and executed by the CPU 72 to process the signal from thesignal source 78. As such, the computer system 71 is a general purposecomputer system that becomes a specific purpose computer system whenexecuting the routine 77 of the present invention.

The computer system 71 also includes an operating system and microinstruction code. The various processes and functions described hereincan either be part of the micro instruction code or part of theapplication program (or combination thereof) which is executed via theoperating system. In addition, various other peripheral devices can beconnected to the computer platform such as an additional data storagedevice and a printing device.

It is to be further understood that, because some of the constituentsystem components and method steps depicted in the accompanying figurescan be implemented in software, the actual connections between thesystems components (or the process steps) may differ depending upon themanner in which the present invention is programmed. Given the teachingsof the present invention provided herein, one of ordinary skill in therelated art will be able to contemplate these and similarimplementations or configurations of the present invention.

While the present invention has been described in detail with referenceto a preferred embodiment, those skilled in the art will appreciate thatvarious modifications and substitutions can be made thereto withoutdeparting from the spirit and scope of the invention as set forth in theappended claims.

1. A method for finding a ranking function ƒ that classifies featurepoints in an n-dimensional space, said method comprising the steps of:providing a plurality of feature points x_(k) in an n-dimensional spaceR^(n), said feature points derived from a digital medical image;providing training data A comprising a plurality of sets of trainingsamples A^(j) wherein${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$wherein S is a number of sets and a j^(th) set A^(j) includes m_(j)samples x_(i) ^(j); providing an ordering E={(P,Q)|A^(P)

A^(Q)} of at least some of said training data sets wherein all trainingsamples x_(i)εA^(P) are ranked higher than any sample x_(j)εA^(Q);solving a mathematical optimization program to determine said rankingfunction ƒ that classifies said feature points x into said plurality ofsets A, wherein for any two sets A^(i), A^(j), wherein A^(i)

A^(j), the ranking function ƒ satisfies inequality constraintsƒ(x_(i))≦ƒ(x_(j)) for all x_(i)εconv(A^(i)) and x_(i)εconv(A^(j)),wherein conv(A) represents the convex hull of the elements of set A. 2.The method of claim 1, wherein the ranking function is a linear functionof the feature points x of the form w′x, wherein w is an n-dimensionalvector.
 3. The method of claim 2, wherein said mathematical optimizationprogram includes slack variables y greater or equal to zero fornon-separable sets wherein not all inequality constraints can besatisfied, wherein said slack variables are a measure of the extent towhich constraints are violated in said mathematical program.
 4. Themethod of claim 3, comprising one slack variable y^(i) for each of saidtraining samples x_(i), wherein any training sample point inside theconvex hull of any set is associated with a slack variable equal to aconvex combination of y^(i) with coefficients λ.
 5. The method of claim4, wherein said mathematical program is of form${\min\limits_{\{{w,y^{i},{\gamma^{ij}❘{{({i,j})} \in E}}}\}}{v{y}^{2}}} + {\frac{1}{2}w^{\prime}w}$such that the equation set Q_(ij) is satisfied ∀(i, j)εE, wherein w isan n-dimensional vector, v is real number controlling the trade offbetween the two terms, equation set Q_(ij) is${Q_{ij} \equiv \begin{Bmatrix}{{\gamma^{ij} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} \geq 0} \\{{{\hat{\gamma}}^{ij} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} \geq 0} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} \leq {- 1}} \\{y^{i},{y^{j} \geq 0}}\end{Bmatrix}},$ wherein γ^(ij) and {circumflex over (γ)}^(ij) arederived by applying Farka's theorem to inequality conditionsw′A^(j)λ^(i)−w′A^(i)λ^(i)≦−1 on constraints λ^(j), λ^(i), respectively,wherein 0≦λ^(i,j)≦1, ∑λ^(i, j) = 1, and K is an arbitrary kernelfunction.
 6. The method of claim 4, wherein said linear inequalityconstraints are equalities represented by ${Q_{ij} = \begin{Bmatrix}{{{\gamma^{ij} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} = 0},} \\{{{{\hat{\gamma}}^{ij} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} = 0},} \\{{\gamma^{ij} + {\hat{\gamma}}^{ij}} = {- 1.}}\end{Bmatrix}},$ wherein vε

is a weighting of said slack terms, γ^(ij) and {circumflex over(γ)}^(ij) are derived by applying Farka's theorem to equality conditionsw′A^(j)λ^(j)−w′A^(i)λ^(i)=−1 on constraints λ^(j), λ^(i), respectively,wherein 0≦λ^(i,j)≦1, ∑λ^(i, j) = 1, and K is an arbitrary kernelfunction, and wherein said mathematical program is of form${\min\limits_{\{{v,{\gamma^{\mathbb{i}j}❘{{({i,j})} \in E}}}\}}{\frac{1}{2}{\sum\limits_{{({i,j})} \in E}^{\quad}\left\lbrack {{v\left( {{{{- \gamma^{\mathbb{i}j}} - {{K\left( {A^{i},A^{\prime}} \right)}v}}}_{2}^{2} + {{{\hat{\gamma}}^{\mathbb{i}j} + {{K\left( {A^{j},A^{\prime}} \right)}v}}}_{2}^{2}} \right)} + {\mu{{{\hat{\gamma}}^{\mathbb{i}j} + \gamma^{\mathbb{i}j} + 1}}_{2}^{2}}} \right\rbrack}}} + {v}_{2}^{2}$wherien με

is a weighting of the equality constraints.
 7. The method of claim 6,further comprising solving said mathematical program by means of leastsquares.
 8. The method of claim 6, wherein μ is approximately one. 9.The method of claim 1, wherein the number of sets is two, represented byA⁺ and A⁻, wherein A⁻

A⁺, and wherein the ranking function satisfies the constraints${{{w^{\prime}A^{\prime -}\lambda^{-}} - {w^{\prime}A^{\prime +}\lambda^{+}}} \leq {- 1}},{{for}\quad{all}\quad\left( {\lambda^{+},\lambda^{-}} \right)\quad{such}\quad{that}\quad\begin{Bmatrix}{{0 \leq \lambda^{+} \leq 1},{{\sum\lambda^{+}} = 1}} \\{{0 \leq \lambda^{-} \leq 1},{{\sum\lambda^{-}} = 1}}\end{Bmatrix}},$ wherein w is a vector in

^(n).
 10. The method of claim 9, wherein A⁺ and A⁻ are non-separable,and wherein the ranking function satisfiesw′A′⁻λ⁻−w′A′⁺λ⁺≦−1+(λ⁻y⁻+λ⁻+λ⁺y⁺), wherein y⁺, y⁻ are slack variablesgreater than or equal to zero.
 11. The method of claim 1, wherein saidfeature points represent tissue sample regions.
 12. The method of claim11, further comprising using said ranking to determine a probability ofsaid tissue sample being diseased.
 13. The method of claim 11, furthercomprising using said ranking to determine a malignancy of diseasedtissue sample regions.
 14. The method of claim 11, wherein said tissuesample regions are derived from a plurality of patients, and furthercomprising using said ranking to sort said plurality of patientsaccording to a predetermined criteria.
 15. The method of claim 1,wherein said ordering of at least some of said training data sets isprovided by a physician.
 16. The method of claim 1, wherein saidtraining samples are assigned to sets based on the results of adiagnostic test.
 17. The method of claim 1, wherein said trainingsamples are assigned to sets by a physician.
 18. The method of claim 1,wherein said feature points are derived from a patient's electronicmedical record.
 19. A program storage device readable by a computer,tangibly embodying a program of instructions executable by the computerto perform the method steps for finding a ranking function ƒ thatclassifies feature points in an n-dimensional space, said methodcomprising the steps of: providing a plurality of feature points x_(k)in an n-dimensional space R^(n), said feature points derived from adigital medical image; providing training data A comprising a pluralityof sets of training samples A^(j) wherein${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$wherein S is a number of sets and a j^(th) set A^(j) includes m_(j)samples x_(i) ^(j); providing an ordering E={(P,Q)|A^(P)

A^(Q)} of at least some of said training data sets wherein all trainingsamples x_(i)εA^(P) are ranked higher than any sample x_(j)εA^(Q);solving a mathematical optimization program to determine said rankingfunction ƒ that classifies said feature points x into said plurality ofsets A, wherein for any two sets A^(i), A^(j), wherein A^(i)

A^(j), the ranking function ƒ satisfies inequality constraintsƒ(x₁)≦ƒ(x_(j)) for all x_(i)εconv(A^(i)) and x_(j)εconv(A^(j)), whereinconv(A) represents the convex hull of the elements of set A.
 20. Thecomputer readable program storage device of claim 19, wherein theranking function is a linear function of the feature points x of theform w′x, wherein w is an n-dimensional vector.
 21. The computerreadable program storage device of claim 20, wherein said mathematicaloptimization program includes slack variables y greater or equal to zerofor non-separable sets wherein not all inequality constraints can besatisfied, wherein said slack variables are a measure of the extent towhich constraints are violated in said mathematical program.
 22. Thecomputer readable program storage device of claim 21, comprising oneslack variable y^(i) for each of said training samples x_(i), whereinany training sample point inside the convex hull of any set isassociated with a slack variable equal to a convex combination of y^(i)with coefficients λ.
 23. The computer readable program storage device ofclaim 22, wherein said mathematical program is of form${\min\limits_{\{{w,y^{i},{\gamma^{i\quad j}❘{{({i,j})} \in E}}}\}}{v{y}^{2}}} + {\frac{1}{2}w^{\prime}w}$such that the equation set Q_(ij) is satisfied ∀(i, j)εE, wherein w isan n-dimensional vector, v is real number controlling the trade offbetween the two terms, equation set Q_(ij) is${Q_{ij} \equiv \begin{Bmatrix}{{\gamma^{\mathbb{i}j} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} \geq 0} \\{{{\hat{\gamma}}^{\mathbb{i}j} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} \geq 0} \\{{\gamma^{\mathbb{i}j} + {\hat{\gamma}}^{\mathbb{i}j}} \leq {- 1}} \\{y^{i},{y^{j} \geq 0}}\end{Bmatrix}},$ wherein γ^(ij) and {circumflex over (γ)}^(ij) arederived by applying Farka's theorem to inequality conditionsw′A^(j)λ^(j)−w′A^(i)λ^(i)≦−1 on constraints λ^(j), λ^(i), respectively,wherein 0≦λ^(i,j)≦1, ∑λ^(𝕚, j) = 1, and K is an arbitrary kernelfunction.
 24. The computer readable program storage device of claim 22,wherein said linear inequality constraints are equalities represented by$Q_{ij} = {\begin{Bmatrix}{{{\gamma^{\mathbb{i}j} + {{K\left( {A^{i},A^{\prime}} \right)}v} + y^{i}} = 0},} \\{{{{\hat{\gamma}}^{\mathbb{i}j} - {{K\left( {A^{j},A^{\prime}} \right)}v} + y^{j}} = 0},} \\{{\gamma^{\mathbb{i}j} + {\hat{\gamma}}^{\mathbb{i}j}} = {- 1.}}\end{Bmatrix}.}$ wherein vε

is a weighting of said slack terms, γ^(ij) and {circumflex over(γ)}^(ij) are derived by applying Farka's theorem to equality conditionsw′A^(j)λ^(j)−w′A^(i)λ^(i)=−1 on constraints λ^(j), λ^(i), respectively,wherein 0≦λ^(i,j)≦1, ∑λ^(𝕚, j) = 1, and K is an arbitrary kernelfunction, and wherein said mathematical program is of form${\min\limits_{\{{v,{\gamma^{\mathbb{i}j}❘{{({i,j})} \in E}}}\}}{\frac{1}{2}{\sum\limits_{{({i,j})} \in E}^{\quad}\left\lbrack {{v\left( {{{{- \gamma^{\mathbb{i}j}} - {{K\left( {A^{i},A^{\prime}} \right)}v}}}_{2}^{2} + {{{\hat{\gamma}}^{\mathbb{i}j} + {{K\left( {A^{j},A^{\prime}} \right)}v}}}_{2}^{2}} \right)} + {\mu{{{\hat{\gamma}}^{\mathbb{i}j} + \gamma^{\mathbb{i}j} + 1}}_{2}^{2}}} \right\rbrack}}} + {v}_{2}^{2}$wherein με

is a weighting of the equality constraints.
 25. The computer readableprogram storage device of claim 24, the method further comprisingsolving said mathematical program by means of least squares.
 26. Thecomputer readable program storage device of claim 24, wherein μ isapproximately one.
 27. The computer readable program storage device ofclaim 19, wherein the number of sets is two, represented by A⁺ and A⁻,wherein A⁻

A⁺, and wherein the ranking function satisfies the constraints${{{w^{\prime}A^{\prime -}\lambda^{-}} - {w^{\prime}A^{\prime +}\lambda^{+}}} \leq {- 1}},{{for}\quad{all}\quad\left( {\lambda^{+},\lambda^{-}} \right)\quad{such}\quad{that}\quad\begin{Bmatrix}{{0 \leq \lambda^{+} \leq 1},{{\sum\lambda^{+}} = 1}} \\{{0 \leq \lambda^{-} \leq 1},{{\sum\lambda^{-}} = 1}}\end{Bmatrix}},$ wherein w is a vector in

^(n).
 28. The computer readable program storage device of claim 27,wherein A⁺ and A⁻ are non-separable, and wherein the ranking functionsatisfies w′A′⁻λ⁻−w′A′⁺λ⁺≦−1+(λ⁻y⁻+λ⁺y⁺), wherein y⁺, y⁻ are slackvariables greater than or equal to zero.
 29. The computer readableprogram storage device of claim 19, wherein said feature pointsrepresent tissue sample regions.
 30. The computer readable programstorage device of claim 29, the method further comprising using saidranking to determine a probability of said tissue sample being diseased.31. The computer readable program storage device of claim 29, the methodfurther comprising using said ranking to determine a malignancy ofdiseased tissue sample regions.
 32. The computer readable programstorage device of claim 29, wherein said tissue sample regions arederived from a plurality of patients, and further comprising using saidranking to sort said plurality of patients according to a predeterminedcriteria.
 33. The computer readable program storage device of claim 19,wherein said ordering of at least some of said training data sets isprovided by a physician.
 34. The computer readable program storagedevice of claim 19, wherein said training samples are assigned to setsbased on the results of a diagnostic test.
 35. The computer readableprogram storage device of claim 19, wherein said feature points arederived from a patient's electronic medical record.
 36. The computerreadable program storage device of claim 19, wherein said trainingsamples are assigned to sets by a physician.
 37. A method for finding aranking function ƒ that classifies feature points in an n-dimensionalspace, said feature points derived from a digital medical image whereinsaid feature points represent tissue sample regions, said methodcomprising the steps of: providing a plurality of feature points x_(k)in an n-dimensional space R^(n); providing training data A comprising aplurality of sets of training samples A^(j) wherein${A = {\bigcup_{j = 1}^{S}\left( {A^{j} = \left\{ x_{i}^{j} \right\}_{i = 1}^{m_{j}}} \right)}},$wherein S is a number of sets and a j^(th) set A^(j) includes m_(j)samples x_(i) ^(j); solving a mathematical optimization program todetermine said ranking function ƒ that classifies said feature points xinto said plurality of sets A, wherein for any two sets A^(i), A^(j),wherein A^(i)

A^(j), the ranking function ƒ is a linear function of the feature pointsx of the form w′x, wherein w is an n-dimensional vector, the rankingfunction satisfying inequality constraints ƒ(x_(i))≦ƒ(x_(j)) for allx_(i)εconv(A^(i)) and x_(i)εconv(A^(j)), wherein conv(A) represents theconvex hull of the elements of set A.
 38. The method of claim 37,further comprising providing an ordering E={(P,Q)|A^(P)

A^(Q)} of at least some of said training data sets wherein all trainingsamples x_(i)εA^(P) are ranked higher than any sample x_(i)εA^(Q).